TADA Module 1: Water Quality Portal Data Discovery and Cleaning
TADA Team
2024-03-11
Source:vignettes/TADAModule1.Rmd
TADAModule1.RmdOverview
This vignette walks through how to use the TADA R Package to discover and clean (i.e., wrangle, Quality Assure and Quality Control (QAQC), and harmonize) Water Quality Portal (WQP) data from multiple organizations.
Install and load packages
First, install and load the remotes package specifying the repo. This is needed before installing TADA because it is only available on GitHub (not CRAN).
install.packages("remotes",
repos = "http://cran.us.r-project.org"
)
library(remotes)Next, install and load TADA using the remotes package. USGS’s dataRetrieval and other TADA R Package dependencies will also be downloaded automatically from CRAN with the TADA install. If desired, the development version of dataRetrieval can be downloaded directly from GitHub (un-comment).
remotes::install_github("USEPA/TADA",
ref = "develop",
dependencies = TRUE
)
# remotes::install_github("USGS-R/dataRetrieval", dependencies=TRUE)
# if you experience any issues installing TADA, try un-commenting and running
# the line below before the install
# options(download.file.method = "wininet")Finally, use the library() function to load the TADA R Package into your R session.
Retrieve WQP data
WQP data is retrieved and processed for compatibility with TADA. This function, TADA_DataRetrieval, builds on USGS’s dataRetrieval R package functions. It joins three WQP profiles: Site, Sample Results (physical/chemical metadata), and Project. In addition, it changes all data in the Characteristic, Speciation, Fraction, and Unit fields to uppercase, removes exact duplicates, and addresses result values that include special characters.
This function uses the same inputs as the dataRetrieval
readWQPdata function. readWQPdata does not
restrict the characteristics pulled from Water Quality Portal
(WQP).
Data retrieval filters include:
startDate
endDate
characteristicName
sampleMedia
siteType
statecode (review list of possible state and territory abbreviations)
countycode
siteid
organization
project
huc
characteristicType
The default TADA_DataRetrieval function automatically runs the TADA_AutoClean function. In this example, we will set TADA_AutoClean = FALSE and run it as a separate step in the work flow.
Tips:
-
All the query filters for the WQP work as an AND but within the fields there are ORs. For example:
Characteristics: If you choose pH & DO - it’s an OR. This means you will retrieve both pH OR DO data if available.
States: Similarly, if you choose VA and IL, it’s an OR. This means you will retrieve both VA OR IL data if available.
Combinations of fields are ANDs, such as State/VA AND Characteristic/DO”. This means you will receive all DO data available in VA.
“Characteristic” and “Characteristic Type” also work as an AND. This means that the Characteristic must fall within the CharacteristicGroup if both filters are being used, if not you will get an error.
The “siteid” is a general term WQP uses to describe both Site IDs from USGS databases and Monitoring Location Identifiers (from WQX). Each monitoring location in the Water Quality Portal (WQP) has a unique Monitoring Location Identifier, regardless of the database from which it derives. The Monitoring Location Identifier from the WQP is the concatenated Organization Identifier plus the Site ID number. Site IDs that only include a number are only unique identifiers for monitoring locations within USGS NWIS or EPA’s WQX databases separately.
Additional resources:
Review function documentation by entering the following code into the console: ?TADA_DataRetrieval
dataRetrieval
Uncomment below if you would like to review differences between the profiles you would get using readWQPdata vs. TADA_DataRetrieval. The profiles are different because TADA_DataRetrieval automatically joins in data from multiple WQP profiles, and does some additional data cleaning as part of the data retrieval process.
This example includes monitoring data collected from Jan 2018 to Jan 2019 by six organizations: 1) Red Lake Band of Chippewa Indians, 2) Sac & Fox Nation, 3) Pueblo of Pojoaque, 4) Minnesota Chippewa Tribe (Fond du Lac Band), 5) Pueblo of Tesuque, and 6) The Chickasaw Nation
We will move forward with this example in the remainder of the vignette.
# dataRetrieval_example <- dataRetrieval::readWQPdata(organization = c("REDLAKE_WQX", "SFNOES_WQX", "PUEBLO_POJOAQUE", "FONDULAC_WQX", "PUEBLOOFTESUQUE", "CNENVSER"), startDate = "2018-01-01", endDate = "2019-01-01", ignore_attributes = TRUE)Use the code below to download data from the WQP using TADA_DataRetrieval. Edit the code chuck below to define your own WQP query inputs.
Downloads using TADA_DataRetrieval will have the same columns each time, but be aware that data are uploaded to the Water Quality Portal by individual organizations, which may or may not follow the same conventions. Data and metadata quality are not guaranteed! Make sure to carefully explore any data and make conservative quality assurance decisions where information is limited.
Note: TADA_DataRetrieval (by leveraging dataRetrieval), automatically converts the date times to UTC. It also automatically converts the data to dates, datetimes, and numerics based on a standard algorithm.
Enter ?TADA_DataRetrieval into the console to review more example queries and additional information.
TADAProfile <- TADA_DataRetrieval(organization = c("REDLAKE_WQX", "SFNOES_WQX", "PUEBLO_POJOAQUE", "FONDULAC_WQX", "PUEBLOOFTESUQUE", "CNENVSER"), startDate = "2018-01-01", endDate = "2019-01-01", applyautoclean = FALSE)
#> [1] "Downloading WQP query results. This may take some time depending upon the query size."
#> NEWS: USGS data availability and format are changing.
#> Beginning in mid-March 2024 the data obtained from legacy profiles
#> will not include new USGS data or recent updates to existing data.
#> To view the status of changes in data availability and code functionality, visit:
#> https://doi-usgs.github.io/dataRetrieval/articles/Status.html
#> If you have additional questions about these changes,
#> email CompTools@usgs.gov.
#> NEWS: USGS data availability and format are changing.
#> Beginning in mid-March 2024 the data obtained from legacy profiles
#> will not include new USGS data or recent updates to existing data.
#> To view the status of changes in data availability and code functionality, visit:
#> https://doi-usgs.github.io/dataRetrieval/articles/Status.html
#> If you have additional questions about these changes,
#> email CompTools@usgs.gov.If you need to download a large amount of data from across a large area, and the TADA_DataRetrieval function is not working due to WQP timeout issues, then the TADA_BigDataRetrieval function may work better.
This function does multiple synchronous data calls to the WQP (waterqualitydata.us). It uses the WQP summary service to limit the sites downloaded to only those with relevant data. It pulls back data from set number of stations at a time and then joins the data back together to produce a single TADA compatible dataframe as the output.
See ?TADA_BigDataRetrieval for more details. WARNING, some of the examples below can take multiple HOURS to run. The total run time depends on your query inputs.
# AK_AL_WaterTemp <- TADA_BigDataRetrieval(startDate = "2000-01-01", endDate = "2022-12-31", characteristicName = "Temperature, water", statecode = c("AK","AL"))
# AllWaterTemp <- TADA_BigDataRetrieval(characteristicName = "Temperature, water")
# AllPhosphorus <- TADA_BigDataRetrieval(characteristicName = "Phosphorus")
# AllCT <- TADA_BigDataRetrieval(statecode = "CT")Some TADA users are interested in using WQP data for surface water only or for analysis of some non-water data. The TADA_AnalysisDataFilter function can assist in identifying results of interest. Multiple columns are used to identify groundwater results as different organizations may populate different combinations of fields in order to identify a result as groundwater.
This function identifies surface water, groundwater, and sediment results. Users can specify whether all results should be returned with a new column, TADA.UseForAnalysis.Flag,identifying if the result should be included in further analysis or if only results that should be in included are returned.
The defaults are to include surface water, exclude groundwater and sediment, and to return only the results that should be used for analysis (clean = TRUE). This is shown in the active example below. If you would like to see all results with the TADA.UseForAnalysis.Flag column, you can uncomment the example where clean = FALSE.
If you are not interested in using TADA_AnalysisDataFilter, but would like to filter by activity media, uncomment the example to filter for water data only by using dplyr::filter() with TADA.ActivityMediaName.
# Filter to retain only results for use in analysis
TADAProfile <- TADA_AnalysisDataFilter(TADAProfile,
clean = TRUE,
surface_water = TRUE,
ground_water = FALSE,
sediment = FALSE)
#> [1] "TADA_AnalysisDataFilter: Identifying groundwater results."
#> [1] "TADA_AnalysisDataFilter: Flagging surface water results to include in assessments."
#> [1] "TADA_AnalysisDataFilter: Flagging groundwater results to exclude from assessments."
#> [1] "TADA_AnalysisDataFilter: Flagging sediment results to exclude from assessments."
#> [1] "TADA_AnalysisDataFilter: Removing results flagged for exclusion from assessments."
# Add TADA.UseForAnalysis.Flag column to identify which results should be used for analysis
# TADAProfile <- TADA_AnalysisDataFilter(TADAProfile, clean = FALSE)
# Remove data for non-water media types, alternate workflow without using TADA_AnalysisDataFilter()
# TADAProfile <- dplyr::filter(TADAProfile, TADA.ActivityMediaName == "WATER")AutoClean
Now TADA_AutoClean can be run on a smaller dataset after unnecessary results have been removed. It performs the following functions on the data retrieved from the WQP:
TADA_ConvertSpecialChars - converts result value columns to numeric and flags non-numeric values that could not be converted.
TADA_ConvertResultUnits - unifies result units for easier quality control and review
TADA_ConvertDepthUnits - converts depth units to a consistent unit (meters).
TADA_IDCensoredData - categorizes detection limit data and identifies mismatches in result detection condition and result detection limit type.
Other helpful actions - converts important text columns to all upper-case letters, removes exact duplicates, and uses WQX format rules to harmonize specific NWIS metadata conventions (e.g. move characteristic speciation from the TADA.ResultMeasure.MeasureUnitCode column to the TADA.MethodSpeciationName column)
As a general rule, TADA functions do not change any contents in the WQP-served columns. Instead, they add new columns with the prefix “TADA.” The following columns are numeric versions of their WQP origins:
- TADA.ResultMeasureValue
- TADA.DetectionQuantitationLimitMeasure.MeasureValue
- TADA.LatitudeMeasure
- TADA.LongitudeMeasure
These functions also add the columns TADA.ResultMeasureValueDataTypes.Flag and TADA.DetectionQuantitationLimitMeasure.MeasureValueDataTypes.Flag, which provide information about the result values that is needed to address censored data later on (i.e., nondetections). Specifically, these new columns flag if special characters are included in result values, and specifies what the special characters are.
# run TADA_AutoClean on filtered dataset to convert special characters, result units, and depth units and identify censored data.
TADAProfile <- TADA_AutoClean(TADAProfile)
#> [1] "TADA_Autoclean: creating TADA-specific columns."
#> [1] "TADA_Autoclean: checking for special characters."
#> [1] "TADA_Autoclean: updating deprecated (i.e. retired) characteristic names."
#> [1] "8 results in your dataset have one of the following deprecated characteristic names: Inorganic nitrogen (nitrate and nitrite) ***retired***use Nitrate + Nitrite. These names have been substituted with the updated preferred names in the TADA.CharacteristicName field."
#> [1] "TADA_Autoclean: harmonizing result and depth units."
#> [1] "NOTE: This version of the TADA package is designed to work with quantitative (numeric) data with media name: 'WATER'. TADA_AutoClean does not currently remove (filter) data with non-water media types. If desired, the user must make this specification on their own outside of package functions. Example: dplyr::filter(.data, TADA.ActivityMediaName == 'WATER')"Review all column names in the TADA Profile to familiarize yourself with the dataset after TADA_AutoClean has added additional TADA prefixed columns. TADA_SummarizeColumn summarizes the data set based on the user specified column and returns a dataframe displaying the number of sites and number of records for each unique value in the specified column. The example below uses TADA.CharacteristicName.
# View column names for TADAProfile
colnames(TADAProfile)
#> [1] "ActivityTypeCode"
#> [2] "ActivityMediaName"
#> [3] "TADA.ActivityMediaName"
#> [4] "ActivityMediaSubdivisionName"
#> [5] "ResultSampleFractionText"
#> [6] "TADA.ResultSampleFractionText"
#> [7] "CharacteristicName"
#> [8] "TADA.CharacteristicName"
#> [9] "SubjectTaxonomicName"
#> [10] "SampleTissueAnatomyName"
#> [11] "MethodSpeciationName"
#> [12] "TADA.MethodSpeciationName"
#> [13] "TADA.ComparableDataIdentifier"
#> [14] "ActivityStartDate"
#> [15] "ActivityStartTime.Time"
#> [16] "ActivityStartTime.TimeZoneCode"
#> [17] "ActivityStartDateTime"
#> [18] "ResultTimeBasisText"
#> [19] "ResultMeasureValue"
#> [20] "TADA.ResultMeasureValue"
#> [21] "TADA.ResultMeasureValueDataTypes.Flag"
#> [22] "ResultValueTypeName"
#> [23] "ResultMeasure.MeasureUnitCode"
#> [24] "TADA.ResultMeasure.MeasureUnitCode"
#> [25] "TADA.WQXResultUnitConversion"
#> [26] "ResultDetectionConditionText"
#> [27] "DetectionQuantitationLimitTypeName"
#> [28] "DetectionQuantitationLimitMeasure.MeasureValue"
#> [29] "TADA.DetectionQuantitationLimitMeasure.MeasureValue"
#> [30] "TADA.DetectionQuantitationLimitMeasure.MeasureValueDataTypes.Flag"
#> [31] "DetectionQuantitationLimitMeasure.MeasureUnitCode"
#> [32] "TADA.DetectionQuantitationLimitMeasure.MeasureUnitCode"
#> [33] "ResultDepthHeightMeasure.MeasureValue"
#> [34] "TADA.ResultDepthHeightMeasure.MeasureValue"
#> [35] "TADA.ResultDepthHeightMeasure.MeasureValueDataTypes.Flag"
#> [36] "ResultDepthHeightMeasure.MeasureUnitCode"
#> [37] "TADA.ResultDepthHeightMeasure.MeasureUnitCode"
#> [38] "ResultDepthAltitudeReferencePointText"
#> [39] "ActivityRelativeDepthName"
#> [40] "ActivityDepthHeightMeasure.MeasureValue"
#> [41] "TADA.ActivityDepthHeightMeasure.MeasureValue"
#> [42] "TADA.ActivityDepthHeightMeasure.MeasureValueDataTypes.Flag"
#> [43] "ActivityDepthHeightMeasure.MeasureUnitCode"
#> [44] "TADA.ActivityDepthHeightMeasure.MeasureUnitCode"
#> [45] "ActivityTopDepthHeightMeasure.MeasureValue"
#> [46] "TADA.ActivityTopDepthHeightMeasure.MeasureValue"
#> [47] "TADA.ActivityTopDepthHeightMeasure.MeasureValueDataTypes.Flag"
#> [48] "ActivityTopDepthHeightMeasure.MeasureUnitCode"
#> [49] "TADA.ActivityTopDepthHeightMeasure.MeasureUnitCode"
#> [50] "ActivityBottomDepthHeightMeasure.MeasureValue"
#> [51] "TADA.ActivityBottomDepthHeightMeasure.MeasureValue"
#> [52] "TADA.ActivityBottomDepthHeightMeasure.MeasureValueDataTypes.Flag"
#> [53] "ActivityBottomDepthHeightMeasure.MeasureUnitCode"
#> [54] "TADA.ActivityBottomDepthHeightMeasure.MeasureUnitCode"
#> [55] "StatisticalBaseCode"
#> [56] "ResultFileUrl"
#> [57] "ResultAnalyticalMethod.MethodName"
#> [58] "ResultAnalyticalMethod.MethodDescriptionText"
#> [59] "ResultAnalyticalMethod.MethodIdentifier"
#> [60] "ResultAnalyticalMethod.MethodIdentifierContext"
#> [61] "ResultAnalyticalMethod.MethodUrl"
#> [62] "SampleCollectionMethod.MethodIdentifier"
#> [63] "SampleCollectionMethod.MethodIdentifierContext"
#> [64] "SampleCollectionMethod.MethodName"
#> [65] "SampleCollectionMethod.MethodDescriptionText"
#> [66] "SampleCollectionEquipmentName"
#> [67] "MeasureQualifierCode"
#> [68] "ResultCommentText"
#> [69] "ActivityCommentText"
#> [70] "HydrologicCondition"
#> [71] "HydrologicEvent"
#> [72] "DataQuality.PrecisionValue"
#> [73] "DataQuality.BiasValue"
#> [74] "DataQuality.ConfidenceIntervalValue"
#> [75] "DataQuality.UpperConfidenceLimitValue"
#> [76] "DataQuality.LowerConfidenceLimitValue"
#> [77] "SamplingDesignTypeCode"
#> [78] "LaboratoryName"
#> [79] "ResultLaboratoryCommentText"
#> [80] "ResultIdentifier"
#> [81] "ActivityIdentifier"
#> [82] "OrganizationIdentifier"
#> [83] "OrganizationFormalName"
#> [84] "ProjectName"
#> [85] "ProjectDescriptionText"
#> [86] "ProjectIdentifier"
#> [87] "ProjectFileUrl"
#> [88] "QAPPApprovedIndicator"
#> [89] "QAPPApprovalAgencyName"
#> [90] "CountryCode"
#> [91] "StateCode"
#> [92] "CountyCode"
#> [93] "MonitoringLocationName"
#> [94] "MonitoringLocationTypeName"
#> [95] "MonitoringLocationDescriptionText"
#> [96] "LatitudeMeasure"
#> [97] "TADA.LatitudeMeasure"
#> [98] "LongitudeMeasure"
#> [99] "TADA.LongitudeMeasure"
#> [100] "HUCEightDigitCode"
#> [101] "MonitoringLocationIdentifier"
#> [102] "AquiferName"
#> [103] "AquiferTypeName"
#> [104] "LocalAqfrName"
#> [105] "ConstructionDateText"
#> [106] "WellDepthMeasure.MeasureValue"
#> [107] "WellDepthMeasure.MeasureUnitCode"
#> [108] "WellHoleDepthMeasure.MeasureValue"
#> [109] "WellHoleDepthMeasure.MeasureUnitCode"
#> [110] "ActivityDepthAltitudeReferencePointText"
#> [111] "ActivityEndDate"
#> [112] "ActivityEndTime.Time"
#> [113] "ActivityEndTime.TimeZoneCode"
#> [114] "ActivityEndDateTime"
#> [115] "ActivityConductingOrganizationText"
#> [116] "SampleAquifer"
#> [117] "ActivityLocation.LatitudeMeasure"
#> [118] "ActivityLocation.LongitudeMeasure"
#> [119] "ResultStatusIdentifier"
#> [120] "ResultWeightBasisText"
#> [121] "ResultTemperatureBasisText"
#> [122] "ResultParticleSizeBasisText"
#> [123] "USGSPCode"
#> [124] "BinaryObjectFileName"
#> [125] "BinaryObjectFileTypeCode"
#> [126] "AnalysisStartDate"
#> [127] "ResultDetectionQuantitationLimitUrl"
#> [128] "LabSamplePreparationUrl"
#> [129] "timeZoneStart"
#> [130] "timeZoneEnd"
#> [131] "SourceMapScaleNumeric"
#> [132] "HorizontalAccuracyMeasure.MeasureValue"
#> [133] "HorizontalAccuracyMeasure.MeasureUnitCode"
#> [134] "HorizontalCollectionMethodName"
#> [135] "HorizontalCoordinateReferenceSystemDatumName"
#> [136] "VerticalMeasure.MeasureValue"
#> [137] "VerticalMeasure.MeasureUnitCode"
#> [138] "VerticalAccuracyMeasure.MeasureValue"
#> [139] "VerticalAccuracyMeasure.MeasureUnitCode"
#> [140] "VerticalCollectionMethodName"
#> [141] "VerticalCoordinateReferenceSystemDatumName"
#> [142] "FormationTypeText"
#> [143] "ProjectMonitoringLocationWeightingUrl"
#> [144] "DrainageAreaMeasure.MeasureValue"
#> [145] "DrainageAreaMeasure.MeasureUnitCode"
#> [146] "ContributingDrainageAreaMeasure.MeasureValue"
#> [147] "ContributingDrainageAreaMeasure.MeasureUnitCode"
#> [148] "ProviderName"
#> [149] "LastUpdated"
# Review the number of sites and number of records for each CharacteristicName in TADAProfile
TADAProfile_CharSummary <- TADA_SummarizeColumn(TADAProfile, "TADA.CharacteristicName")
# View TADAProfile_CharSummary
TADAProfile_CharSummary
#> # A tibble: 104 × 3
#> TADA.CharacteristicName n_sites n_records
#> <chr> <int> <int>
#> 1 .ALPHA.-ENDOSULFAN 6 7
#> 2 .BETA.-ENDOSULFAN 6 7
#> 3 ALDRIN 6 7
#> 4 ALKALINITY, TOTAL 126 688
#> 5 ALPHA PARTICLE 6 14
#> 6 ALUMINUM 6 7
#> 7 AMMONIA-NITROGEN 83 328
#> 8 AMMONIUM 16 93
#> 9 APPARENT COLOR 39 106
#> 10 ARSENIC 37 38
#> # ℹ 94 more rowsInvalid coordinates
Review station locations and summary information using the TADA_OverviewMap function. TADA_OverviewMap counts the number of unique results, characteristics, and organizations at each monitoring location in the dataset and creates a tidy map for reviewing summary stats spatially. Larger point sizes indicate more results collected at a given site, while darker blue colors indicate more unique characteristics collected at the site. Users may click on a site to view a pop-up with this summary information, including the number of organizations that reported results at that site. This map may inform a user’s decision to remove/correct sites that are outside the US.
TADA_OverviewMap(TADAProfile)